N-gram language modeling of Japanese using bunsetsu boundaries

نویسندگان

  • Sungyup Chung
  • Keikichi Hirose
  • Nobuaki Minematsu
چکیده

A new scheme of N-gram language modeling was proposed for Japanese, where word N-grams were calculated separately for the two cases: crossing and not crossing bunsetsu boundaries. Here, bunsetsu is a basic grammatical (and pronunciation) unit of Japanese. A similar scheme using accent phrase boundaries instead of bunsetsu boundaries has already been proposed by the authors with a certain success, but it suffered from the training data shortage, because assignment of accent phrase boundaries requires a speech corpus. In contrast, bunsetsu boundaries can be detected automatically from a written text with a rather high accuracy using a parser. It was shown from the experiment that a perplexity reduction was possible by estimating bunsetsu boundaries from the history longer than N-1 words in the case of N-gram modeling and by selecting one from two types of models (crossing and not crossing bunsetsu boundaries) according to the estimation. When 1 or 3 years of Mainichi Newspaper corpus was used for the training of tri-grams, the proposed scheme could reduce the perplexity by around 8% from the baseline modeling (without separation). The proposed language modeling was applied to a continuous speech recognition, and it showed that an improvement in word recognition rate was possible especially when the training corpus was small (1 year of newspaper).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Stochastic Language Model using Dependency and Its Improvement by Word Clustering

In this paper, we present a stochastic language model for Japanese using dependency. The prediction unit in this model is all attribute of "bunsetsu". This is represented by the product of the head of content words and that of function words. The relation between the attributes of "bunsetsu" is ruled by a context-free grammar. The word sequences axe predicted from the attribute using word n-gra...

متن کامل

N-gram Language Modeling of Japanese Using Prosodic Boundaries

A new method was developed to include prosodic boundary information into statistical language modeling. This method is based on counting word transitions separately for the cases crossing accent phrase boundaries and not crossing them. Since direct calculation of the above two types of word transitions requires a large speech corpus which is practically impossible to make, bi-gram counts of par...

متن کامل

Automatic Bunsetsu Segmentation of Japanese Sentences Using a Classi cation Tree

Bunsetsu, which is comprised of a content word followed by, possibly 0, function words, is a convenient unit for dependency structure analysis of Japanese. There are, however, no spaces indicating bunsetsu boundaries in the orthographic writing of Japanese. Thus a sentence must be segmented into bunsetsu's by some means prior to dependency structure analysis. Conventionally, such segmentation h...

متن کامل

Automatic Bunsetsu Segmentation of Japanese Sentences Using a Classification Tree

Bunsetsu, which is comprised of a content word followed by, possibly 0, function words, is a convenient unit for dependency structure analysis of Japanese. There are, however, no spaces indicating bunsetsu boundaries in the orthographic writing of Japanese. Thus a sentence must be segmented into bunsetsu's by some means prior to dependency structure analysis. Conventionally, such segmentation h...

متن کامل

Phrasal complexity and the occurrence of filled pauses in presentation speeches in Japanese

Filled pauses are ubiquitous in everyday speech. I investigated whether linguistic complexity of upcoming phrases affects filler rate at phrase boundaries in presentation speeches in Japanese. Filler rate at phrase boundaries increased monotonically with complexity of the following phrases. However, when the following phrase was composed of more than 11 Bunsetsu-phrases, the filler rate did not...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004